[PyTorch debug] Fix issue with tp_group=None by pggPL · Pull Request #2733 · NVIDIA/TransformerEngine

pggPL · 2026-03-04T09:21:10Z

Description

If no TP is used, then tp_size=1 and tp_group=None inside TE modules.
In the debug feature utility get_reduction_params(), when TEDebugState.weight_tensor_tp_group_reduce=True, we previously set reduction_group = tp_group unconditionally. With tp_group=None, downstream distributed collectives interpret this as the default/world group, which can cause unintended cross-rank reduction for weight stats.

This PR fixes that behavior by making TP-group reduction explicit and safe:

if weight_tensor_tp_group_reduce=True but tp_group is None, skip reduction for weight stats;
if weight_tensor_tp_group_reduce=True and tp_group is available, reduce in that TP group;
existing behavior for other paths remains unchanged.

This prevents accidental world-group reductions when TP is effectively disabled (tp_size=1 / tp_group=None) while preserving the intended semantics of "reduce only within TP group" for weight tensors.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

pggPL · 2026-03-04T09:21:33Z

/te-ci pytorch L1

greptile-apps · 2026-03-04T09:24:04Z

Greptile Summary

This PR fixes a subtle bug in get_reduction_params() within the debug utilities where passing tp_group=None (the case when TP is disabled, i.e. tp_size=1) would inadvertently assign None to reduction_group. Since group=None in PyTorch distributed collectives resolves to the default world process group, this caused unintended cross-rank reduction of weight tensor statistics across all ranks rather than restricting it to the intended TP group.

Key changes:

In transformer_engine/debug/features/utils/__init__.py, the get_reduction_params() function now guards the reduction_group = tp_group assignment behind a tp_group is not None check (line 26).
When weight_tensor_tp_group_reduce=True but tp_group is None, skip_reduction is set to True (line 30), preventing the accidental world-group reduction entirely.
A clear inline comment (lines 24-25) explains why None must not be assigned to reduction_group.
All callers (log_tensor_stats.py, log_fp8_tensor_stats.py, log_nvfp4_tensor_stats.py) remain unaffected as the API contract (skip_reduction, reduction_group, reduce_within_microbatch) is unchanged.

Confidence Score: 5/5

This PR is safe to merge; it's a minimal, targeted bug fix that prevents real world-group reduction issues when TP is disabled.
The fix is a single, well-scoped guard check that prevents a concrete bug (world-group reduction when tp_group=None) without altering any existing code paths. The logic is straightforward, inline comments clearly explain the intent, and the change is localized to a utility function with a stable API contract.
No files require special attention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["get_reduction_params(tensor_name, tp_group)"] --> B{tensor_name == 'weight'?}
    B -- No --> C["skip_reduction=False<br/>reduction_group=default<br/>reduce_within_microbatch=True"]
    B -- Yes --> D{weight_tensor_tp_group_reduce?}
    D -- False --> E["skip_reduction=True<br/>reduce_within_microbatch=False"]
    D -- True --> F{tp_group is not None?}
    F -- Yes --> G["reduction_group=tp_group<br/>skip_reduction=False<br/>reduce_within_microbatch=False"]
    F -- No --> H["skip_reduction=True ✅ FIXED<br/>reduce_within_microbatch=False<br/>(was: reduction_group=None → world group 🐛)"]

_{Last reviewed commit: c2adcaf}

code drop

c2adcaf

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch debug] Fix issue with tp_group=None#2733

[PyTorch debug] Fix issue with tp_group=None#2733
pggPL wants to merge 1 commit intoNVIDIA:mainfrom
pggPL:fix_debug_hang

pggPL commented Mar 4, 2026

Uh oh!

pggPL commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pggPL commented Mar 4, 2026

Description

Type of change

Checklist:

Uh oh!

pggPL commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Confidence Score: 5/5

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant